February 27, 2019

Objectives

  • Substantive research interests
    • Broader question: Emergence of AfD as party and parliamentary presence - what are the effects on party competition and parliamentarism?
    • Descriptive (preliminary) question: What are the prevalent framings in speeches given by AfD parliamentarians?
    • Contagion hypothesis (diffusion): (Speakers of) other parliamentary groups may take over framings offered by AfD speakers.
    • cp. DFG-Projekt “The populist challenge in parliament” (2019-2021, cooperation with Christian Strecker, Marcel Lewandowsky, Jochen Müller)
  • Methodological interests
    • Validity and intersubjectivity of data-driven, “distant reading” approaches (in the eHumanities)
    • ML/AI: Annotation to gain training data for statistical learning => gold standard annotation
    • Social sciences: Traditions of coding and annotating text data: Quantitative/qualitative content analysis

Focus of the presentation

  • Combining distant and close reading is an unfulfilled promise: Software often inhibits combining both perspectives. How to implement workflows for coding and annotating textual data? The polmineR R package is presented as potential solution.

  • Special focus: Interactive graph annotation as an approach to generate intersubjectively shared interpretations/understandings of discourse patterns.

  • Schedule:
    • Theory is code
    • The MigParl corpus
    • AfD Keywords
    • Graph annotation
    • Conclusions

Theory is code

Combining R and CWB

A design for close and distant reading

  • Why R?
    • the most common programming language in the social sciences
    • comprehensive availability of statistical methods
    • great visualisation capabilites
    • usability: RStudio as IDE
    • reproducible research: Rmarkdown notebooks
  • Why the Corpus Workbench (CWB)?
    • a classic toolset for corpus analysis
    • indexing and compression of corpora => performance
    • powerful and versatile syntax of the Corpus Query Processor (CQP)
    • permissive license (GPL)
  • NoSQL / Lucene / Elasticsearch are potential alternatives - but not for now

The PolMine Project R Packages

The core family of packages:

  • polmineR: basic vocabulary for corpus analysis

  • RcppCWB: wrapper for the Corpus Workbench (using C++/Rcpp, follow-up on rcqp-package)

  • cwbtools: tools to create and manage CWB indexed corpora

And there are a few other packages:

  • GermaParl: documents and disseminates GermaParl corpus
  • frappp: framework for parsing plenary protocols
  • annolite: light-weight fulltext display and annotation tool
  • topicanalysis: workflows with quantitative/qualitative elements for topic models
  • gradget: graph annotation widget

polmineR: Objectives

  • performance: if analysis is slow, interaction with the data will suffer

  • portability: painless installation on all major platforms

  • open source: no restrictions and inhibiting licenses

  • usability: make full use of the RStudio IDE

  • documentation: transparency of the methods implemented

  • theory is code: combine quantitative and qualitative methods

Getting started

  • Getting started with polmineR is easy: Assuming that R and RStudio are installed, polmineR can be installed as simple as follows (dependencies such as RcppCWB will be installed automatically). Type in an R session:
install.packages("polmineR")
  • Get the GermaParl corpus (corpus of plenary debates in the German Bundestag).
drat::addRepo("polmine") # add CRAN-style repository to known repos
install.packages("GermaParl") # the downloaded package includes a small sample dataset
GermaParl::germaparl_download_corpus() # get the full corpus
  • That’s it. Ready to go.
library(polmineR)
use("GermaParl") # activate the corpora in the GermaParl package, i.e. GERMAPARL

polmineR - the basic vocabulary

One of the ideas of the polmineR package is to offer a set of intuitive verbs to implement common analytical tasks:

  • create subcorpora: partition(), subset()

  • counting: hits(), count(), dispersion() (vgl.: size())

  • create term-document-matrices: as.TermDocumentMatrix()

  • get keywords / feature extraction: features()

  • compute cooccurrences: cooccuurrences(), Cooccurrences()

  • inspect concordances: kwic()

  • recover full text: get_token_stream(), html(), read()

Metadata and partitions/subcorpora

  • This is the “good old” workflow to create partitions (i.e. subcorpora):
p <- partition("GERMAPARL", year = 2001)
m <- partition("GERMAPARL", speaker = "Merkel", regex = TRUE)
  • And there is an emerging new workflow …
am <- corpus("GERMAPARL") %>% subset(speaker == "Angela Merkel")

m <- corpus("GERMAPARL") %>% subset(grep("Merkel", speaker)) # Petra Merkel!

cdu_csu <- corpus("GERMAPARL") %>%
  subset(party %in% c("CDU", "CSU")) %>%
  subset(role != "presidency")
  • You might read the code aloud as follows: “We generate a subcorpus X by taking the corpus GERMAPARL, subsetting it based on criterion Y, …”

Counting and dispersions

dt <- dispersion("GERMAPARL", query = "Flüchtlinge", s_attribute = "year")
barplot(height = dt$count, names.arg = dt$year, las = 2, ylab = "Häufigkeit")

Concordances / KWIC output

q <- '[pos = "NN"] "mit" "Migrationshintergrund"'
corpus("GERMAPARL") %>% kwic(query = q, cqp = TRUE, left = 10, right = 10)

Validating sentiment analaysis

kwic("GERMAPARL", query = "Islam", positivelist = c(good, bad)) %>%
  highlight(lightgreen = good, orange = bad) %>%
  tooltips(setNames(SentiWS[["word"]], SentiWS[["weight"]])) %>%
  knit_print()

Full text output

  • This is how you can recover the fulltext of a subcorpus.
corpus("GERMAPARL") %>% # take the GERMAPARL corpus
  subset(date == "2009-11-10") %>% # create a subcorpus based on a date
  subset(speaker == "Merkel") %>% # get me the speech given by Merkel
  html(height = "250px") %>% # turn it into html
  highlight(list(yellow = c("Bundestag", "Regierung"))) # and highlight words of interest
  • Inspecting the fulltext can be extremely useful to evaluate topic models: This is how you would highlight the most likely terms of a topicmodel using polmineR:
h <- get_highlight_list(BE_lda, partition_obj = ek, no_token = 150)
h <- lapply(h, function(x) x[1:8])

corpus("BE") %>%
  subset(date == "2005-04-28") %>%
  subset(grepl("Körting", speaker)) %>% 
  as.speeches(s_attribute_name = "speaker", verbose = FALSE)[[4]] %>% 
  html(height = "350px") %>%
  highlight(highlight = h)

Data

The MigParl Corpus

  • The following analysis is based on the MigTex corpus.

  • The corpus has been prepared in the MigTex Project (“Textressourcen für die Migrations- und Integrationsforschung”, funding: BMFSFJ)

  • Preparation of all plenary debates in Germany’s regional parliaments (2000-2018) using the “Framework for Parsing Plenary Protocols” (frappp-package)

  • Extraction of a thematic subcorpus using unsupervised learning (topic modelling)

  • Size of the MigParl corpus: 27241205 tokens

  • size without interjections and presidency: 22837376

  • structural annotation: id | speaker | party | role | lp | session | date | regional_state | interjection | year | agenda_item | agenda_item_type | speech | topics | harmonized_topics

As announced initially, our analytical concern is speeches given by AfD parliamentarians.

MigParl by year

AfD in MigParl - tokens

AfD in MigParl - share

MigParl - regional dispersion

So what’s in the data?

  • There is an (unsurprising) peak of debates on migration and integration affairs in 2015.

  • The total number of words spoken by AfD parliamentarians and the relative share has increased, as the AfD made it into an increasing number of regional parliaments.

  • The AfD presence is stronger in the Eastern regional states, corresponding to stronger electoral results there.

AfD Keywords

Term extraction explained

  • To gain a first insight into the thematic foci and linguistic features of AfD speakers, we use the technique of term extraction.

  • The fundamental idea is to identify terms that occurr more often in a corpus of interest compared to a reference corpus than would be expected by chance. The statistical test used is a chisquare-test.

  • To exemplify the flexibility of polmineR, we move beyond the analysis of single words, and inspect 2- and 3-grams, considering particularly interesting sequences of part-of-speech-tags.

  • What we may learn from the following three tables is that assumed features of populist style remain present when the AfD arrived in parliament: Foreigners and asylum-seekers are an object of concern (using pejorative language), and we see vocabulary that indicates the critique of established parties and elites.

Term extraction I

Term extraction II (ADJA - NN)

Term extraction III (NN-ART-NN)

Graph Annotation

The elusive merit of cooccurrence graphs

  • Cooccurrence graphs are an eye-catcher and have become a popular analytical approach in the eHumanities (see the following examples).

  • The visualisations are very suggestive and seem to be a great condensation of ideas we have about discourse.

  • But are these interpretations sound and do they meet standards of intersubjectivity?

  • To start will, I will indicate that there are many choices behind these visualisations that can be contested.

  • The solution I suggest is to work with three-dimensional, interactive graph visualisations that can be annotated (called gradgets, for graph annotation widgets).

Ego-Networks

Leipzig Corpus Miner (LCM)

“Wuchern der Rhizome”

(Joachim Scharloth)


polmineR & cooccurrences

  • The polmineR package offers the functionality to get the cooccurrences for a specific query of interest. The default method for calculating cooccurrences is the loglikelihood test.

  • The cooccurrences()-method can be applied to subcorpora / partitions, and corpora.

cooccurrences("GERMAPARL", query = 'Islam', left = 10, right = 10)

Getting all cooccurrences

Starting with polmineR v0.7.9.11, the package includes a method to efficiently calculate all cooccurrences in a corpus.

m <- partition("GERMAPARL", year = 2008, speaker = "Angela Merkel", interjection = F)
drop <- terms(m, p_attribute = "word") %>% noise() %>% unlist()
Cooccurrences(m, p_attribute = "word", left = 5L, right = 5L, stoplist = drop) %>% 
  decode() %>% # 
  ll() %>%
  subset(ll >= 10.83) %>%
  subset(ab_count >= 5) -> coocs

AfD Cooccurrences

Graph visualisation (2D, N = 100)

Graph-Visualisierung (2D, N = 250)

Graph-Visualisierung (2D, N = 400)

Where we stand

  • The graph layout depends heavily on filter decisions.

  • Filtering is necessary, but there are difficulties to justify filter decisions.

  • Graph visualisation implies many possibilities to provide extra information, but there are perils of information overload.

  • If we try to omit filter decisions, we run into the problem of overwhelming complexity of large graphs.

  • How to handle the complexity and create the foundations for intersubjectivity?

Graph visualisation (3D)

So ‘gradgets’ are the solution suggested here. The links to the following three gradgets offer a visualisation that is interactive in a double sense:

  1. You can turn the visualisation in three-dimensional space
  2. You can click on the edges and nodes, get the concordances that are behind the statistical evaluation, and leave an annotation.

In a real-world workflow, the result of the graph annotation exercise can be stored and put into an online appendix to a publication that explains interpretative results.

So these are the gradgets:

Conclusions

Conclusions

The results of this research are very preliminary:

  • There is a (somewhat surprising) explicit politeness of AfD speakers.

  • There is no autism at all! But a lot of interaction with other parties (and visitors!).

  • There are antagonismus: “Wir” (AfD / AfD-Fraktion) and the others.

  • It’s the economy: Introducing a redistributive Logic as a leitmotiv.

But in a way, AfD speeches served only as a case how we might develop the idea of “visual hermeneutics” (Gary Schaal): If we decide to work with cooccurrence graphs, graph annotation is the approach suggested here to realise the idea of distant and close reading, and to achieve intersubjectivity.